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Detection of false-positive motifs is one of the main causes of low performance in motif finding 

methods. It is generally assumed that false-positives are mostly due to algorithmic weakness of motif- 
finders [1-3]. Here, however, we derive the theoretical dependence of false positives on dataset size and 
r— I find that false positives can arise as a result of large dataset size, irrespective of the algorithm used. 

Interestingly, the false-positive strength depends more on the number of sequences in the dataset 
than it does on the sequence length. As expected, false-positives can be reduced by decreasing the 
• sequence length or by adding more sequences to the dataset. The dependence on number of sequences, 

. ^ however, diminishes and reaches a plateau after which adding more sequences to the dataset does 

not reduce the false-positive rate significantly. Based on the theoretical results presented here, we 
qh provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in 

I— I practice. 

'i> Introduction 

^ Because binding of sequence specific transcription factors to their recognition sites in non- 
O coding DNA is an important step in the control of gene expression, the development of com- 
^ putational methods to identify transcription factor binding motifs in non-coding DNA has 
^ received much attention in computational biology. The low information content of transcrip- 
O tion factor binding motifs implies difficulty for computational analyses. For example, given a 
known binding motif, identification of bona fide examples is always plagued by false positives 
. ^ - the so-called Futihty Theorem [4] . 

^ An even more challenging computational problem is the de novo identification of transcrip- 

tion factor binding motifs (so-called motif-finding), for which there are many available tools 
(for tutorials on different methods sec [5, 6] and references therein). Despite the substantial 
algorithm development effort in this area, recent comprehensive benchmark studies [1-3] re- 
vealed that the performance of DNA motif-finders leaves room for improvement in realistic 
scenarios, where known transcription factor binding sites have been planted in test sequence 
sets. One of the major problems is that DNA motif-finders can identify seemingly strong can- 
didate motifs, even when randomly chosen sequences arc provided as the input. This has led 
to simulation-based approaches to identify the bona fide motifs where the motif-finding algo- 
rithm is repeated several times on random data and the p-value is computed as the fraction 
of motifs with better scores than the motif identified in real data [7] . While feasible for expert 
computational biologists, this approach requires significant computational resources, and is 
not practical for most biological users. 
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We argue that part of low performance in motif finding algorithms is due to the statistical 
nature of large sequence datasets: when the dataset is large enough, any structure can occur 
by chance. We formalize this idea using information theory, to obtain a remarkably simple 
analytical relationship between the size of the sequence search space and the strength of the 
false-positive motifs. Interestingly, our analysis shows that for biologically realistic dataset 
sizes and motif strengths, false positives as strong as real transcription factor binding sites 
are quite likely to arise. This represents an extension of the "Futility Theorem" [4] to the de 
novo motif-finding problem. Results 

Results 

Motif-finders are expected to find strong signals in random DNA sequences: We 
represent patterns in DNA sequence families (called motifs) as probability matrices, where 
each column specifies the distribution of the DNA letters. The underlying idea here is to 
quantify the probability of observing motifs of a certain strength in a set of random sequences 
using large-deviations theory. Suppose that the set of all motifs, X, is generated according 
to a random nucleotide background distribution g (for instance, g can be the genome-wide 
distribution of nucleotides). It is expected that all nucleotides in motifs will have frequencies 
close to those in g. Therefore, motifs that have a distribution significantly different from g (i.e. 
the false-positives in our case) are considered as the rare events that are far from expectation. 
We use the large-deviations theory, in particular Sanovs theorem [8] to measure the probability 
of these rare events. We, then, derive expected size of the set X above which the observation 
of strong motifs becomes likely to be due to chance. 

Let a DNA motif with W columns have a distribution or probability matrix / (see Fig. 
1 and Methods for definition of motif finding problem parameters). The difference between 
the distribution of the motif, /, and the background distribution, g, is measured using the 
Kullback-Leibler (KL) divergence [8], also known as the biological information content [9] [10], 
defined as in the following: 

w „ 

D{f,g)=Ise,if,g)^Yl E /ifclog— (1) 
J=l ke{T,C,A,G} 

where fjk is the relative frequency of base k in column j of the motif, and gk is the background 
distribution of base k (e.g. the genome- wide distribution of nucleotide bases). Throughout the 
text wc use the strength of a motif and its information content, interchangeably to refer to 
D{f,g) and Iseq. 

Our main theoretical result is as follows. Consider the "one-occurrence-per-sequence" 
motif-finding model where each of n sequences is assumed to have exactly one occurrence 
of a motif of width W. The expected sequence length, L, in order to observe at least one motif 
with a probability matrix (PM) diverged from the background, g, by at least D{f,g) is given 
by: 

(n+ 1)^(1-41-1)/^ y^) 

where |^| is the cardinality of the set A, e.g. \A\ = 4 for DNA sequences. According to this 
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theorem, if the length of DNA sequences is approximately (or larger than) L, false-positive 
motifs with information content D{f,g) can occur by chance. Please see Appendix A for the 

proof of this theorem. 

Figure 2 show the expected length of sequence, L as a function of motif information content, 
D{f,g), for DNA sequences with typical motif-finding parameters and W = 5, W = 10 and 
W = 15, respectively. Each graph illustrates L, at which false-positive motifs with strength 
D{f,g) are expected to occur by chance. 

The dependency of false-positives on n is stronger compared to the dependency on L. As 
an example, for motifs with W = 10 (Fig. 2b), a threefold increase of n (while keeping L 
constant) reduces D{f,g) by the same amount as if L were increased by 2 orders of magnitude 
(while keeping n unchanged). However, the dependency of false-positives on n decreases with 
n and reaches a plateau for larger n suggesting that in order to reduce the false-positive rate 
only a sufficient number of sequences in the dataset is necessary (Fig. 3). 

Finally, the false-positive information content, D{f,g), is approximately linear in W in the 
range of interest (Fig. 4). Therefore, given a motif-strength of interest, detecting real motifs 
with smaller width is easier and less prone to false-positives. 

MEME performance is consistent with the theoretical expectations: To confirm 
our theoretical results, we conducted a set of experiments using the MEME software [11, 12], 
because it implements the one-occurrence-per-sequence set up that wc have treated theoreti- 
cally (see Methods for detail of the experiment setup). We ran MEME on a set of randomly 
generated sequences and asked MEME to report the most significant motif. The detected 
motifs are therefore false-positives. The results from MEME are plotted in Fig. 2 and are 
consistently following the theoretical predictions. 

Simple rules of thumb for DNA motif -finding: The theoretical predictions provide 
sequence lengths above which observation of motifs with given strengths or less are most 
probably due to chance than any biological reason. Therefore, to reduce the false-positive 
strength in experimental design, it is generally desired to move towards weaker motifs (using 
Eq. 2 or using the curves in Fig. 2). We have the following rules of thumb for this purpose: 

(1) As it is intuitively expected, it is generally preferred to use shorter sequences (when it is 
biologically plausible) to avoid unnecessary false-positives. 

(2) Adding more sequences to the dataset reduces the false-positive rate considerably (e.g. 
using 30 sequences compared to 10 reduces the false-positive motif strengths by more than 
6 bits (%25) for W = 10, see Fig. 3). This effect, however, diminished for larger n (e.g. 
increasing n from 30 to 50 has only 2 bits reduction in motif strengths, see Fig. 3). This 
suggests that in order to reduce false-positive rate in "one-occurrence-per-sequence" motif 
finding, only a "sufficient" number of sequences is needed. 

(3) The dependency of false-positives (the strength of false-positive motifs) on L is weaker 
than dependency on n. Therefore, using many sequences (but not too many) is generally 
preferred to using shorter sequences. 

(4) Given n sequences of length L and a width W for potential motifs, Eq. 2 gives expected 
strength of false-positive motifs. Detected motifs that do not greatly exceed this expected 
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strength should be doubted, while motifs that are stronger than the expected value are 
most probably not false-positives. 
(5) Given a certain strength of interest, detection of motifs with smaller width is less prone 
to false-positives and therefore easier. 

Examples of applications: In using the theoretical results in Eq. 2 or the graphs in Fig. 
2, it is generally desired to move towards weaker motifs (towards the left on the graphs). To 

illustrate this we chose the ZFP423 and the TATA-box motifs from the Jaspar database [13] 
with D{f,g) = 17.93 and D{f,g) = 10.20, respectively. We show that is it difficult to detect 
ZFP423 in sequences of length 1000, but it can be detected in shorter sequences (Fig. 5). 
Similarly, we show that it is very difficult to detect the TATA-box using 20 sequences, but it 
is possible if this is increased to 30 or if the motif is trimmed to include only the core positions 
(Fig. 5). 

Discussion 

Application to protein sequences: The theoretical analysis here can be applied directly 
for motif-finding in sequences of different alphabets. In particular, the proposed equations can 
be used for protein sequences by replacing \A\ = 4 with |^| = 20 corresponding to 20 amino- 
acid residues. It is easy to verify in Eq. 2, that by this modification, i.e. changing |^| = 4 to 
\A\ = 20, the expected length, L, increases exponentially. This suggests that, under equivalent 
settings, the false-positive rate in the protein motif finding is exponentially lower than in the 
DNA motif finding. 

Extension to other motif-finding models: The proposed method here assumes the "one- 
occurrencc-pcr-sequence" model in motif finding (similar to the OOPS model in MEME [12]). 
However, the analysis is extendable to other models by appropriately redefining the space 
of all motifs in the dataset. See Appendix B for extension of Eq. 2 to the cases where each 
sequence can carry either zero or one motif (similar to ZOOPS model in MEME [12]). 

A simple formula for computing the p-value: For a motif with a given PM /, the 
p- value is defined as the probability of observing stronger motifs assuming that the sequences 
are generated according to a background distribution. 

There are different approaches for accurately computing the p-value [10, 14-17]. While 
these approaches provide sophisticated methods that precisely compute the p-value, they 
tend to be complicated to implement. Here, however, as a side-product of our main results, 
we provide a simple equation that conservatively approximates the p-value. 

Specifically, given n sequences, the p-value of a motif / with width W is no more than: 

pval ^{n+ i)W{\A\-i)2-nDU,9) (3) 

Please see the Appendix A for the detail of derivation of this equation. 
Methods 

Motif finding problem: The motif-finding problem considered here assumed the "one- 
occurrence-per-sequence" model. It is assumed that there are n sequences of length L in the 



2010 motiffinding 



data set (see Fig. 1 for the definition of different parameters) . The motifs are assumed to have 
W columns with a probabihty matrix denoted by /. The motifs PM represents the relative 
frequency of symbols (e.g. DNA bases) in each column of the motif. We measure the strength 
of a motif by the divergence of its PM from a uniform background distribution g. We use the 
KuUback-Leilber (KL) divergence, also referred to as biological information content [9, 10], 
denoted by D{f,g) = Iseq{f,g) (see Eq. 1). 

Correction of information content bias due to the sampling error: The theoretical 
result in Eq. 2 is accurate for relatively large n. However, in practical application, where the 
number of sequences is relatively small, e.g. n < 15 for DNA sequences, a sampling error in 
computing / causes a bias in the information content D{f,g). We account for this bias by 
subtracting an approximate term suggested in [9] from the information content used in Eq. 2 
as follows: 

\A\ — 1 

Dcorrectedif, Q) ~ D{f, g) = ^^^^W (4) 

where In is the natural logarithm. The contribution of sampling error vanishes as n increases. 

Simulations: In each experiment, we generated a set of n sequences with length L drawn 
from a uniform background distribution g = [0.25 0.25 0.25 0.25]. We then ran the MEME 
using OOPS model (only one motif per sequence) and restricted MEME to generate only one 
motif (the most significant) with width W. We repeated the experiment for different number 
of sequences (n = {10,20,30}), different motif width {W = {5,10,15}), and different sequence 
lengths (L = {50, 100, 500, 1000, 5000}). We repeated each experiment for 50 Monte-Carlo runs 
resulting in 50 data points for each experiment. 

For each detected motif, we computed the information content or divergence, D{f,g), 
using the PMs reported by MEME. Since the input to MEME is a set of random sequences, 
all detected motifs are supposed to be false-positives. We then compared the false-positives 
detected by MEME with the theoretical predictions. Each motif detected by MEME is depicted 
on figures by a star (*). 
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Figure legends: 

Figure 1. DNA motif finding problem parameters. In this example, n = 5 sequences of 
length L = 80 are used to detect a motif of width W = 15. Corresponding probability matrix, 
/, is also shown that represents the relative frequency of nucleotides in each column of the 
motif. Note that each sequence has only one occurrence of the motif (hence one-occurrence- 
per-sequence (OOPS) model) 

Figure 2. Theoretical results compared to MEME simulations. Theoretical predic- 
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tion of expected sequence length, L, to observe false-positive motifs with information content 
D{f,g) (solid lines) compared to experimental results of MEME (stars ) for three different 
motif width W = 5, 10, and 15. The results are for three different number of sequences, 
n = {10,20,30}, in the dataset. Each set of experiments are repeated for 50 Monte-Carlo 
runs (so there are 50 stars (*) for each set of experiments). The range of information content 
is chosen between and 40 bits corresponding to what we found for motifs in the Jaspar 
database [13] (See Supplementary Fig. 6 that shows the frequency of motifs with respect to 
their corresponding information content). For any given n, decreasing L reduces the strength 
of false-positive motifs. Alternatively, for a fixed L, adding more sequences (increasing n) re- 
duces the false-positive strength. The dependency of motif strength on n is stronger compared 
to the dependency on L. For instance, that for motifs with = 10 in (b), a threefold increase 
of n (while keeping L constant) reduces D{f,g) by the same amount if L is increased by 2 
orders of magnitudes (while keeping n unchanged). 

Figure 3. False-positive information content versus the number of sequences. The 

dependency on false-positives strengths diminishes with increasing n and reaches a plateau 
suggesting that it is not necessary to use too many sequences to maintain an acceptable level 
of false-positives. In this figures, the sequence length is fixed to L = 1000. Simulation results 
from MEME, shown by blue (*). There arc 50 simulation results (50 stars) for each value of 
n. Simulations are done for n = {10, 20, 30, 50, 100}. 

Figure 4. False-positive information content versus the motif width. False-positive 
motifs information content, D{f,g), is shown with respect to the motif width for a fixed L 
and n. For the range of motif widths of our interest (5 to 20), the information content is 
approximately linear in W. Given a motif-strength of interest, detecting real motifs with 
smaller width is easier and less prone to false-positives (i.e. for a given motif strength, shorted 
motifs rarer). In this figure, the sequence length and the number of sequences are fixed to 
L = 1000 and n = 30, respectively. The theoretical predictions are shown by solid line. The 
experimental results from MEME are shown by (*). There are 50 repeated results for each 
= {5, 10, 15}. 

Figure 5. Examples of applications. Two real motifs are used to show the application 
of the theoretical predictions (here motif width is W = 15). Motifs as strong as ZFP423 [13] 
in n = 10 sequences of length L = 1000 will be buried in false-positives. Therefore, in order 
to avoid such false-positive motifs, one can reduce L (along Arrow-2) or preferably add more 
sequences (along Arrow- 1) to the dataset. Similarly, it would be very difficult to identify a 
motif such as the TATA-box motif in a set of 20 sequences with length L = 100 due to false- 
positives. Since using shorter sequences is unlikely, one can increase the number of sequences to 
n = 30 (along Arrow-3) to avoid false-positives that have the same strength as the TATA-box. 
It is interesting to know how strong the false-positive motifs arc for motifs with information 
content equal to the TATA-box but with a width W = 5 (this is equivalent to trimming all 
but the core bases of the TATA-box). Fig. 4 shows that this is equivalent to moving along the 
theoretical curve from W = 15 to W = 5 which reduces the false-positive strength enough to 
detect this motif. 
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Supplementary Information 

1. Supplementary figure 
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Fig. 6. Distribution of motifs in Jaspar database [13] witli respect to tlieir information content, D{f,g). The 
information content is computed using the probability matrix of motifs provided by the database, denoted by 
f, and assuming a uniform background distribution of nucleotides, g. Graphs in Fig. 2 are prepared such that 
they cover the range of information content of motifs found in this database. 




Appendix A. Proof of Theorems 

Here we provide a series of definitions and lemmas that will be used to prove the main theorem. 
The proofs for the lemmas used to prove the main theorem are adopted mainly from [8] with 
minor changes to apply to the motif-finding problem. The outline of the proof is as follows: 

• We assume that a set of n sequences with length L is generated by a background (nu- 
cleotide) distribution g. 

• Using a sliding window of width W we form the motif dataset. 

• We define a divergence function that measures the strength of motifs using their proba- 
bility matrix (PM). 

• We then compute the probability of observing motifs with a given PM. 

• By adding the probabilities of all motifs with stronger PM than the given motif we compute 
the p- value of the motif. 

• We then use the p-value to derive the expected size of the dataset and prove the main 
theorem. 
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2. Priliminaries 

Let denote by y = [Yi, F2, •■• ,^]^ the set of n sequences with length L used in a typical 
motif-finding problem (T is the transpose of a matrix): 



Y 



yu yi2 yi3 ■■■ yi{L-i) yiL 

2/21 2/22 2/23 •■• 2/2(L-l) 2/2L 



(A.l) 



_2/nl 2/n2 2/n3 •■■ 2/n(L-l) 2/nL_ 

Note that each Yi is a row vector of L DNA bases or amino-acid residues. In presenting our 
analysis we consider DNA sequences with alphabets A = {A,T,G,C}. The alphabet size, 
denoted by |^| in this case is equal to |^| = 4. However, the theoretical results are directly 
applicable to any other alphabet size, including |^| = 20 for protein sequences. 

Motif finding algorithms seek to find a set of over or under-represented short subsequences 
in Y. To prepare the dataset for motif finding, we slide a window of length W on each Yi 
shifting by one base at a time to obtain {L — W + 1) subsequences of length W. We then 
arrange n number of such subsequences, one for each Yi, to form a motif X as in the following: 



X 



Xii X12 Xi3 ... Xi(w_i) Xiw 
X2I X22 X23 ■■■ X2(W-i) X2W 



(A.2) 



Xnl Xn2 Xn3 ■■■ Xn(W-l) XnW _ 

Each X is a potential motif. This arrangement is based on the "one-occurrence-per-sequence" 
model in motif finding where each sequence Yi contributes one and only one subsequence to 
motif. 

Wc denote by X the set of all motifs X. The size of this set is equal to \X\ = {L — W + 1)". 

The search for statistically significant motifs, in essence, involves finding X ^ X that is 
distributed difi^erently from a background distribution g (e.g. the distribution of DNA bases 
genome- wide that is commonly considered to be Uniform). To do so, we represent the motif 
X by a probability matrix / defined as: 



(A.3) 



/it /2T hr ■■■ fw-i,T fwT 
he /2c he ■■■ fw-i,c fwc 

hA hA hA — fw-l,A fwA 

ha ho ho ■■■ fw-i,G fwG_ 

where, e.g. /jt denotes the relative frequency of the symbol T in the column of the sub- 
alignment X. The PM / represents the empirical distribution of DNA bases at each column 
of X. For sequences of different alphabet, e.g. protein sequences, the PM is defined with 20 
rows corresponding to the number of amino-acid residues. 

Motifs represent the abundance of a particular set of similarly composed short sequences 
in the set Y; a property that is commonly associated with biological importance [10] [1] [6]. To 
quantify the biological importance of a motif we use information content measure [9] [6] that 
is defined as the divergence of the PM of a motif from a background distribution. Specifically, 
for a motif with a PM /, the divergence from a background distribution g is defined as the 
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KuUback-Leibler (K-L) distance of / and g [10] [6] as in the following: 

w J, 
D{f, g) = Ise,{f, 5) = E E fj'' log — 

j=l ke{T,C,A,G} 

where fjk is defined in (A. 3) and gk is the background distribution of base k. 

The divergence, D{f,g), also known as biological information content of the motif [9], 
is in fact the expected likelihood ratio of the motif given a background distribution g [10]. 
Throughout the manuscript, we refer to a motif X by its PM /. We also use the strength of 
a motif and its information content, interchangeably for the divergence, D{f,g) or Iseq- 

3. Probability of a motif 

The probability of a motif X, under the background distribution g can be written in terms of 
its PM using the following lemma *: 

Lemma 3.1. If a motif X is drawn i.i.d according to g, the probability of X under f , denoted 
by Pg throughout the manuscript, depends only on its PM f and is given by: 

Pg{X) = 2-"(^(/)+^(/.ff)) (A.4) 

where H(f) is the binary entropy of f defined as follows: 

w 

H{f) = Yl E fjk^^sfjk 

j=l ke{T,C,A,G} 

and D{f,g) is defined in (2). 
proof: See ( [8], page 281). 

□ 

One can compute the probability of a motif X using this lemma. However, in order to com- 
pute the probabihty of observing all motifs that have a PM / we need to add the probabilities 
of all such motifs as in the following. 

4. Class of a probability matrix and its probability 

Let us define the set of all X's that have the same PM /, commonly referred to as the class 
of the PM /, as follows: 

r(/) ^ {X G X\pm{X) = /}, (A.5) 

If we count the number of motifs in this class and add their corresponding probabilities using 
(A.4) we can compute the probabihty of all X's with PM /. For this purpose, we use the 
following lemma that gives the size of the class of a PM /: 



*The PM is the empirical distribution of the motif X. In information theory, the empirical distribution is 
commonly referred to as the type of X. The discussion presented here is part of the Method of Types [18] that 
studies statistical properties of sequences based on their types. 
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Lemma 4.1. The size of the type class of f is upper-bounded as folio ws: 

\T{f)\ < T^^^f^ (A.6) 

proof: See ( [8], page 282). 

□ 

It can be seen from (A.6) that as / changes in such a way that has a larger entropy (e.g. 
as it gets closer to a uniform distribution as the background distribution g), the total number 
of motifs, X with PM / becomes exponentially large. Alternatively, when / is such that its 
entropy is lower, i.e. it is a highly skewed PM, the number of motifs with a PM equal to / 
becomes exponentially small. 

Now, to compute the probability of observing motifs with a PM /, one can add up the 
probabilities of all X G T(/) as follows: 

Lemma 4.2. // motifs X are drawn i.i.d according to a distribution g, the probability of 
observing motifs that all have a PM f is upper-bounded as follows: 

Pg{T{f)) < 2-"'^(-'^'3) (A.7) 

proof: The probability of a class T(/) can be written as: 

P9{nf))= E P9ix) 

XdTU) 

= E 2-"(^(^'5)+-^(-'^» (A.8) 
xeT(/) 

< 2nH(f)2-n{DU,g)+HU)) (^^_g^ 
^ 2-nD(f,g) 

where in (A.8) we used (A.4) of Lemma 3.1 and in (A. 9) we used (A.6) of Lemma 4.1. 

□ 

According to this lemma, the probability of observing motifs with a PM / is exponentially 
proportional to the distance of / and g. Therefore, as / gets closer (in the KL divergence sense) 
to g, the probability of observing motifs becomes closer to 1. On the other hand, the probability 
of strong motifs with large divergence from background, i.e. larger D{f,g), is exponentially 
small. 

We now can compute the probability of motifs with PM /. By adding up the probabilities 
of motifs with stronger PMs we can compute the p- value of a motif as in the following. Before 
that, we need to count all possible PMs: 

5. Number of possible PMs 

Enumerating all PM is impractical for larger n. We, instead in the following, drive a bound 
on the number of possible PMs. 
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First, let us consider only one column of a motif X in (A. 2) with a PM as in (A. 3). There 
are n sequences in the motif. Therefore, the column has n symbols chosen from the alphabets 
in A. One can enumerate all possible distributions of bases in this column: 



It can be seen that the numerator of frequencies change from to n. Furthermore, there 
are three independent frequencies in this PM, i.e. the last one is fixed by the rest to have 
a sum equal to 1. Therefore, there are about (n + 1)^ different possible arrangements of this 
frequencies. We formalize this idea for an extended number of columns, W, in the following 
lemma [8]: 

Lemma 5.1. For a motif of width W , there are at most 1^1 < (n + PMs in V . 

■proof: There are |^| — 1 components in the PM of any column (the last component is 
fixed by the the others). The numerator of each component can take n + 1 values. Therefore, 
each column can have (n + PMs. Since each column is independently and identically 

distributed, there are (n + different PMs for the motif of width W . 



6. An approximate value for p- value 

By defining the maximum number of PMs for a motif of width W and knowing the probability 
of the class of each PM (Lemma 4.2) we can now compute the p- value of a motif with PM /. 
The main idea, as explained before, is to first define the set of all motifs with PMs stronger 
than /, i.e. with D > D{f, g) and then to use Lemma 4.2 to compute its probability. This idea is 
formalized in the following theorem, known as Sanov's Theorem. Here we provide a simplified 
version of the proof that is only applicable to our case. Interested readers are referred to ( [8], 
page 292) for general theorem and technical details. 

Lemma 6.1. Given that a set X is generated according to a background distribution, g, the 
probability of observing motifs X with PMs that are diverged from the background at least by 
D{f,g) is upper bounded by: 



where Pg is the probability under the background distribution g. 

proof: We denote by £{f) the set of all motifs, X, that have a PM h that is diverged from 
g at least by D{f,g): 



[ ^ ' n n n n' ^ 




□ 



P,(X)<(n + l)^(l'4|-i)2-"-°(-^'^) 



(A.IO) 



£{f) ^{Xe X\pm{X) = h,D{h,g) > D{f,g)}, 



(A.ll) 
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By definition, the probability of the set E is the p- value of motif with a PM /. The probability 
of the set £ is equal to the sum of the probabilities of the classes of PMs in E. We have: 





(A.12) 








(A.13) 








(A. 14) 






2-nminhe£ D{h,g) 


(A.15) 


h&£ 




^ 2-nD(f,g) 


(A.16) 






2-nD{f,9) ^^1^ 


(A. 17) 


he£ 




2-nD{f,g)^^_^-^^^W{\A\-l) 


(A.18) 



In (A.12) we used the fact that, by definition, the probabihty of the set £ is the sum of 
probabilities of the classes of PMs in E. In (A.13) we used (A. 7) of Lemma 4.2 that gives 
an upper-bound on the probability of the class of a PM h. Inequahty (A. 14) in vahd in 
we replace all 2~"'^^^'3^ in summation with its maximum value. Similarly, this is valid if we 
replace its exponent with its minimum in Inequality (A.15). By definition of the set £ in 
(A. 11), all its PMs, i.e. all h G £ have a divergence not less than D{f,g). Therefore, we 
have rmuh^s D{h,g) = D{f,g) in Inequality (A.16). It can be seen in (A. 17) that D{f,g) is 
independent of the summation and therefore can be taken out. In (A.18) we replace the 
summation with the number of its components, defined by the total number of possible PMs 
given by Lemma 5.1. 

□ 

This Lemma provides an approximate equation (in fact an upper-bound) for the p-value 
of a motif with PM / presented in Eq. 2. 

7. Proof of the main theorem (Eq. 2) 

Theorem 7.1. Given a set y of n sequences of symbols from an alphabet \A\, the expected 
sequence length, L, in order to observe at least one motif of width W and with a PM diverged 
at least as much as D(f,g) is given by: 

L ^ — — (A.19) 

((„+ 1)^1^1-1)) Vn 

where \ A\ is the cardinality of the set A, e.g. \A\ = 4 for DNA sequences with A = {A, T, C, G}. 
This is an approximate lower bound on the expected length. 

Proof. The Lemma 6.1 gives an upper-bound on the probability of observing motifs with a 
type that is diverged greater than D{f,g). This probability when multiplied with the total 
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number of motifs, X, in the set X, gives an upper-bound on the number of such motifs 
observed, as derived in the following. 

Note that the total number of X's in the data set X is equal to \X\ = {L — W + 1)". 
However, it can be easily verified that for large L, each X is overlapped with at least 2 
neighboring X's due to a one-shift-at-a-time sliding window. This results in an approximately 
\X\ K. [L — W + l/WY effectively independent X e X. Therefore, the expected number of 
observations oi X e £{f), denoted by Nf is approximately: 

Nf « \X\Pgi£if)) 

<iL-W + l/VF)"(n + i)W{\A\-i)2-nD(f,g) (^^20) 

This is in fact an upper-bound on the number of expected motifs observed. 

By letting I < Nf, the minimum expected length to observe at least one X with pmw{X) = 
/, becomes: 

^rn2nD(f,g) \ 
.(71+1)^(1-41-1) J 

where we used the fact that in motif-finding problems we have L » — 1. □ 



L>W-1 + 



Appendix B. Extension to other motif-finding models 

The proposed method here assumes the "one-occurrence-per-sequence" model in motif finding 
(similar to the OOPS model in MEME [11]). However, the analysis is extendable to other 
models by appropriately redefining the space of all motifs in the datasct. For instance, in 
cases where each sequence can carry either zero or one motif (similar to ZOOPS model in 
MEME [11]), the following equation provides the expected length L: 

yy2rD{f,g) 
^~ (r„ + l)M/(|^|-l)/n 

where (r < 1) is the fraction of sequences that carry a motif (note that this equation simplifies 
to Eq. 2 for (r = 1)). In this equation, the denominator is always larger than 1. Therefore, the 
expected length is reduced significantly compared to OOPS model, suggesting a potentially 
higher rate of false-positives in ZOOPS models. 
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